August 28, 2016
A special kind of MNAR.
Up to 2/3 of observations and 11 columns have missingness
Missing are controlled by jet_num and predicted Higgs mass (mass_MMC)
The jet means the shot gun shower of decay products of quarks/gluons spitting out toward a particular direction.
The number of jets controls the type of physics interaction, just like the term SVM, GLM, NN are DIFFERENT models of the same umbrella term MACHINE LEARNING
The methodology on MAR imputation is useless, or even harmful to the prediction
We would like to impute the higgs mass, the momentum related variables to zero, AWAY from the variable population range.
Extrapolate the other missingness and view jet_num = 0 -> 1 -> 2 -> 3 as the degeneration cases.
The Tree learning can handle categorical and continuous variables well, seperating out the imputed missing values naturally. This becomes our top candidate!
Simple, Fast & Interpretable
A working ML pipeline
A Performance Baseline
Label ~ Original dataset - EventId - Weight
18 significant variables
Training set AUC: 0.816
Test set AMS: 2.02
The same AUC = 1 pattern happens for the 19 variables with NO missingness. Thus we know this has nothing to do with the imputation metholodgy.
After the Higgs Mass (with missingness) is removed, the top two important variables are DER_mass_transverse_met_lep, DER_mass_vis
Run a RF on these two variables, still produces AUC=1!!!!!
RF regression fits the pdfs by piecewise constant functions
The Noise between the Test set and Train set is huge, causing the RF's class estimation to be off.
It is quite possible the learning problem is too easy in sample. But RF is not able to guess the change of the test - train difference effectively.
XGBoost is able to treat missing values properly
Less sensitive to noise
eXtremely FAST
Low memory use
Automatic parallel processing
No feature engineering
AMS: 3.5
Too many hyper-parameters to be tuned
e.g. Cutoff threshold
Predicted probablity > threshold ==> 'signal'
Predicted probablity <= threshold ==> 'background'